For this exercise, we will be using:
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%pylab inline
In the /data/ folder, you will find a series of .json files called dataN.json, numbered 1-4. Each file contains the following data:
| birthday | first_name | last_name | |
|---|---|---|---|
| 0 | 5\/3\/67 | Robert | Hernandez |
| 1 | 8\/4\/84 | Steve | Smith |
| 2 | 9\/13\/91 | Anne | Raps |
| 3 | 4\/15\/75 | Alice | Muller |
In [47]:
#Your code here...
file1 = pd.read_json('../../data/data1.json')
file2 = pd.read_json('../../data/data2.json')
file2 = pd.read_json('../../data/data2.json')
file3 = pd.read_json('../../data/data3.json') # add orient=columns
file4 = pd.read_json('../../data/data4.json', orient='split')
combined = pd.concat([file1, file2.T, file3, file4], ignore_index=True)
combined
Out[47]:
In the data file, there is a webserver file called hackers-access.httpd. For this exercise, you will use this file to answer the following questions:
In order to accomplish this task, do the following:
user_agents module, the documentation for which is available here: (https://pypi.python.org/pypi/user-agents)
In [48]:
import apache_log_parser
from user_agents import parse
def parse_ua(line):
parsed_data = parse(line)
return str(parsed_data).split('/')[1]
def parse_ua_2(line):
parsed_data = parse(line)
return str(parsed_data).split('/')[2]
#Read in the log file
line_parser = apache_log_parser.make_parser("%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\"")
server_log = open("../../data/hackers-access.httpd", "r")
parsed_server_data = []
for line in server_log:
data = {}
data = line_parser(line)
parsed_server_data.append( data )
server_df = pd.DataFrame(parsed_server_data)
server_df['OS'] = server_df['request_header_user_agent'].apply(parse_ua)
server_df['Browser'] = server_df['request_header_user_agent'].apply(parse_ua_2)
server_df['OS'].value_counts().head(10)
Out[48]:
In [ ]:
#Apply the functions to the dataframe
In [ ]:
#Get the top 10 values
Using the dailybots.csv film, read the file into a DataFrame and perform the following operations:
groupby() function which is documented here: (http://pandas.pydata.org/pandas-docs/stable/groupby.html)
In [60]:
#Your code here...
bots = pd.read_csv('../../data/dailybots.csv')
gov_bots = bots[['botfam', 'hosts']][bots['industry'] == 'Government/Politics']
gov_bots.groupby(['botfam']).size()
Out[60]:
In [ ]: